home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 26
/
Cream of the Crop 26.iso
/
editor
/
dedupe12.zip
/
TECH.DOC
< prev
next >
Wrap
Text File
|
1997-05-12
|
4KB
|
60 lines
"DeDupe" Tecnical Information:
Before I begin, "Source" is actually Reference. "Source Block", "Block2",
"Source Line" (the Reference Line), and "Line2" (the Line that the Reference
Line is Compared To) are All part of the Same File, which is Opened Twice.
The Same File is Opened as the "Source" File, and again as "File2".
I thought the Best Way to Remove Duplicate Lines Located Anywhere would be
to "Read" (Load) the File by Blocks. "Source" Line 1 is Compared to Line 2,
then to Line 3, ...... then to the Last Line of the Current Block. Next
"Source" Line 2 is Compared to Line 3, then to Line 4...... then to the Last
Line of the Current Block. Next "Source" Line 3, etc. The "Source" Line is
Always Lower than "Line2" (the Line Compared To). When all the "Source"
Lines of the Current "Source" Block are Finished, Load the Next "Block2" and
Reset "Source" Block back to Line 1 again. Now Compare "Source" Line 1 to
the first Line in the next "Block2" (the next Block of the File). When
"Block2" is the Last Block of the File ("File2"), and all the "Source" Block
Lines have been Compared to All the Lines in the Last "Block2", ReStart
"File2" again and Advance "Source" Block to the Next Block of the File.
Note: Until now, the "Source" Block has been the First Block of the File.
Advance "File2" Lines (ReStarted "File2") until "Line2" is Past the Current
"Source" Line (Now in the Next Block of the File), then resume Comparing
Lines.
This Process Ends when the "Source" Block reaches the Last Block of the
File, and All "Source" Lines, except the Last Line (Can't Compare to the
Same Line in the Same Block) are Compared.
But wait, it's Not Over yet! Along the way, the Lines Compared that
Match (Duplicate) are Marked (Setting a Bit) in a "Mark" Segment in Computer
Memory. One final Pass Reads the entire File again, using a Line Counter,
which is used to "Index" the Related Bit in the "Mark" Segment to the
Relative Line of the File, in order to determine if the Line is a Duplicate
or Not.
This Process made the Project Complex, but the Alternative was to Limit
the Size of the File from Small to Medium size (whatever can Fit into
available Memory), or to Load the First Line and Compare it to Every Line
(from Line 2) to the End of the File. Next, Line 2, to Every Line (from
Line 3) to the End of the File. This would have made my "Project" Easier,
but for a 30,000 Line File, the File would have to be Read (Loaded) 30,000
times, which is a good way to Shorten the Life of your Hard Drive.
WHY IT TAKES "TIME":
I did Not know, when I started this Project, how many Line Compares would
take place. Test Procedures added to the Program, while it was under
Development, included a "Cycle Counter", which Counted every time a Line
Comparison took Place. I never imagined the Astronomical Number of Times
Lines would be Compared in a Large File. A 301 Line File (Small File)
required a Total of 45,150 Compare Cycles. If you Add just 10 Lines to that
File, it Jumps to 48,205 Cycles, an Increase of 3,055 Cycles. If you Add
another 10 Lines (Now 321 Lines), the Total Cycles Increased to 51,360, an
additional increase of 3,155. Notice the differences also Increased (Non-
Linear). Now Imagine a 20,000 Line File! This is the Reason it took about
12 minutes on my 75Mhz Pentium PC. Note: Two different 20,000 Line Files
will Not take the Same amount of Time. It depends on the number of
Characters in the Lines and how far into each line before a Mis-Match occurs
(move on to the next Line).